Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 33

the proportion of the reads which contain the adaptor sequences at each position. Known

adaptor sequences and description are stored in the “adapter_list.txt” file as shown in

Figure 1.25.

K-mers (sequences of k size of bases) are formed from the adaptor sequences in the

“adapter_list.txt” file and then the program searches for these k-mers to report the total

percentage of the reads which contain these k-mers. The report may discover the sources of

bias due to contaminating adaptor dimers in the library.

A warning is raised if any sequence is present in more than 5% of all reads, and a failure

occurs if any sequence is present in more than 10% of all reads.

Figure 1.26 shows a FASTQ file with raw reads without adaptor content (left) and a

FASTQ file with reads with failed metric due to significant content of Illumina Universal

Adaptor.

1.5.12 K-mer Content

The K-mer content graph plots the count of each short nucleotide of length k (default k = 7)

against positions in reads. In a normal k-mer content, k-mers are expected to be repre-

sented evenly across the length of the reads. The k-mer content graph shows the positions

for the only six most significant k-mers (Figure 1.27). The list of k-mers which are present

at specific position with significant abundance will be reported in a table including k-mer

sequences, counts, p-values, and expected position. Caution is required when the reads are

from RNA-Seq libraries; the significant k-mers may be due to highly expressed gene, and

hence, they will have a biological importance.

FIGURE 1.25 Some known adaptor sequences.

FIGURE 1.26 Adaptor content graphs.